Search CORE

29 research outputs found

What, When and Where of petitions submitted to the UK Government during a time of chaos

Author: Vidgen Bertie
Yasseri Taha
Publication venue
Publication date: 02/07/2019
Field of study

In times marked by political turbulence and uncertainty, as well as increasing divisiveness and hyperpartisanship, Governments need to use every tool at their disposal to understand and respond to the concerns of their citizens. We study issues raised by the UK public to the Government during 2015-2017 (surrounding the UK EU-membership referendum), mining public opinion from a dataset of 10,950 petitions (representing 30.5 million signatures). We extract the main issues with a ground-up natural language processing (NLP) method, latent Dirichlet allocation (LDA). We then investigate their temporal dynamics and geographic features. We show that whilst the popularity of some issues is stable across the two years, others are highly influenced by external events, such as the referendum in June 2016. We also study the relationship between petitions' issues and where their signatories are geographically located. We show that some issues receive support from across the whole country but others are far more local. We then identify six distinct clusters of constituencies based on the issues which constituents sign. Finally, we validate our approach by comparing the petitions' issues with the top issues reported in Ipsos MORI survey data. These results show the huge power of computationally analyzing petitions to understand not only what issues citizens are concerned about but also when and from where.Comment: Preprint; under revie

arXiv.org e-Print Archive

Directions in abusive language training data, a systematic review: Garbage in, garbage out

Author: Derczynski Leon
Vidgen Bertie
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2020
Field of study

Data-driven and machine learning based approaches for detecting, categorising and measuring abusive content such as hate speech and harassment have gained traction due to their scalability, robustness and increasingly high performance. Making effective detection systems for abusive content relies on having the right training datasets, reflecting a widely accepted mantra in computer science: Garbage In, Garbage Out. However, creating training datasets which are large, varied, theoretically-informed and that minimize biases is difficult, laborious and requires deep expertise. This paper systematically reviews 63 publicly available training datasets which have been created to train abusive language classifiers. It also reports on creation of a dedicated website for cataloguing abusive language data hatespeechdata.com. We discuss the challenges and opportunities of open science in this field, and argue that although more dataset sharing would bring many benefits it also poses social and ethical risks which need careful consideration. Finally, we provide evidence-based recommendations for practitioners creating new abusive content training datasets

arXiv.org e-Print Archive

Directory of Open Access Journals

The IT University of Copenhagen's Repository

Islamophobes are not all the same! A study of far right actors on Twitter

Author: Margetts Helen
Vidgen Bertie
Yasseri Taha
Publication venue
Publication date: 08/03/2021
Field of study

Far-right actors are often purveyors of Islamophobic hate speech online, using social media to spread divisive and prejudiced messages which can stir up intergroup tensions and conflict. Hateful content can inflict harm on targeted victims, create a sense of fear amongst communities and stir up intergroup tensions and conflict. Accordingly, there is a pressing need to better understand at a granular level how Islamophobia manifests online and who produces it. We investigate the dynamics of Islamophobia amongst followers of a prominent UK far right political party on Twitter, the British National Party. Analysing a new data set of five million tweets, collected over a period of one year, using a machine learning classifier and latent Markov modelling, we identify seven types of Islamophobic far right actors, capturing qualitative, quantitative and temporal differences in their behaviour. Notably, we show that a small number of users are responsible for most of the Islamophobia that we observe. We then discuss the policy implications of this typology in the context of social media regulation

arXiv.org e-Print Archive

Two contrasting data annotation paradigms for subjective NLP tasks

Author: Hovy Dirk
Pierrehumbert Janet
Rottger Paul
Vidgen Bertie
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

Archivio istituzionale della Ricerca - Bocconi

Handling and Presenting Harmful Text in NLP Research

Author: Birhane Abeba
Derczynski Leon
Kirk Hannah Rose
Vidgen Bertie
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

The IT University of Copenhagen's Repository

Two Contrasting Data Annotation Paradigms for Subjective NLP Tasks

Author: Hovy Dirk
Pierrehumbert Janet B.
Röttger Paul
Vidgen Bertie
Publication venue
Publication date: 01/01/2022
Field of study

Labelled data is the foundation of most natural language processing tasks. However, labelling data is difficult and there often are diverse valid beliefs about what the correct data labels should be. So far, dataset creators have acknowledged annotator subjectivity, but rarely actively managed it in the annotation process. This has led to partly-subjective datasets that fail to serve a clear downstream use. To address this issue, we propose two contrasting paradigms for data annotation. The descriptive paradigm encourages annotator subjectivity, whereas the prescriptive paradigm discourages it. Descriptive annotation allows for the surveying and modelling of different beliefs, whereas prescriptive annotation enables the training of models that consistently apply one belief. We discuss benefits and challenges in implementing both paradigms, and argue that dataset creators should explicitly aim for one or the other to facilitate the intended use of their dataset. Lastly, we conduct an annotation experiment using hate speech data that illustrates the contrast between the two paradigms.Comment: Accepted at NAACL 2022 (Main Conference

arXiv.org e-Print Archive

Archivio istituzionale della Ricerca - Bocconi

The Empty Signifier Problem: Towards Clearer Paradigms for Operationalising "Alignment" in Large Language Models

Author: Hale Scott A.
Kirk Hannah Rose
Röttger Paul
Vidgen Bertie
Publication venue
Publication date: 15/11/2023
Field of study

In this paper, we address the concept of "alignment" in large language models (LLMs) through the lens of post-structuralist socio-political theory, specifically examining its parallels to empty signifiers. To establish a shared vocabulary around how abstract concepts of alignment are operationalised in empirical datasets, we propose a framework that demarcates: 1) which dimensions of model behaviour are considered important, then 2) how meanings and definitions are ascribed to these dimensions, and by whom. We situate existing empirical literature and provide guidance on deciding which paradigm to follow. Through this framework, we aim to foster a culture of transparency and critical evaluation, aiding the community in navigating the complexities of aligning LLMs with human populations.Comment: Socially Responsible Language Modelling Research (SoLaR) @ NeurIPs 202

arXiv.org e-Print Archive

Multilingual HateCheck: functional tests for multilingual hate speech detection models

Author: Nozza Debora
Rottger Paul
Seelawi Haitham
Talat Zeerak
Vidgen Bertie
Publication venue: 'Association for Computational Linguistics (ACL)'
Publication date: 01/01/2022
Field of study

Archivio istituzionale della Ricerca - Bocconi

Recommended from our members

Understanding RT’s Audiences: Exposure Not Endorsement for Twitter Followers of Russian State-Sponsored Media

Author: Crilley Rhys
Gillespie Marie
Vidgen Bertie
Willis Alistair
Publication venue: 'SAGE Publications'
Publication date: 01/01/2020
Field of study

The Russian state-funded international broadcaster RT (formerly Russia Today) has attracted much attention as a purveyor of Russian propaganda. To date, most studies of RT have focused on its broadcast, website, and social media content, with little research on its audiences. Through a data-driven application of network science and other computational methods, we address this gap to provide insight into the demographics and interests of RT’s Twitter followers, as well as how they engage with RT. Building upon recent studies of Russian state-sponsored media, we report three main results. First, we find that most of RT’s Twitter followers only very rarely engage with its content and tend to be exposed to RT’s content alongside other mainstream news channels. This indicates that RT is not a central part of their online news media environment. Second, using probabilistic computational methods, we show that followers of RT are slightly more likely to be older and male than average Twitter users, and they are far more likely to be bots. Third, we identify thirty-five distinct audience segments, which vary in terms of their nationality, languages, and interests. This audience segmentation reveals the considerable heterogeneity of RT’s Twitter followers. Accordingly, we conclude that generalizations about RT’s audience based on analyses of RT’s media content, or on vocal minorities among its wider audiences, are unhelpful and limit our understanding of RT and its appeal to international audiences

Open Research Online (The Open University)

Enlighten